Let's first read the dataset
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import sqlalchemy as sqla
import sqlite3
%matplotlib inline
plt.style.use('ggplot')
First, let's read the vehicle data. Since the files are so large, we can't upload them all in one csv because of the limitation on Github. Instead, we split the datasets into three csv. Then, we are going to read them one by one. Then, we will use pandas.concat to combine them all in one dataframe.
vehicle1 = pd.read_csv('vehicle1.csv')
vehicle2 = pd.read_csv('vehicle2.csv')
vehicle3 = pd.read_csv('vehicle3.csv')
Combine them into one giant dataframe.
vehicle = pd.concat([vehicle1, vehicle2, vehicle3], ignore_index=True)
The dimension of the dataset is given below. We have 34677 observations with 26 variables.
vehicle.shape
The first 5 observations are given below.
vehicle.head()
Since the dataset is huge and it takes longer time to do subset and plot. Instead, we output it as a database object (sqlite) and then use SQL to do subset. It is faster to subset with SQL query. The codes below show how to convert it from CSV to SQLITE object.
import csv, sqlite3
con = sqlite3.connect("vehicle.sqlite")
con.text_factory = str
cur = con.cursor()
cur.execute("CREATE TABLE vehicle (id, title, body, lat, long, posted, updated, drive, odometer, type, header, condition, cylinders, fuel, size, transmission, byOwner, city, time, description, location, url, price, year, maker, makerMethod);") # use your column names here
with open('vehicle1.csv','rb') as fin: # `with` statement available in 2.5+
# csv.DictReader uses first line in file for column headings by default
dr = csv.DictReader(fin) # comma is default delimiter
to_db = [(i['id'], i['title'], i['body'], i['lat'], i['long'], i['posted'], i['updated'], i['drive'], i['odometer'], i['type'], i['header'], i['condition'], i['cylinders'], i['fuel'], i['size'], i['transmission'], i['byOwner'], i['city'], i['time'], i['description'], i['location'], i['url'], i['price'], i['year'], i['maker'], i['makerMethod']) for i in dr]
cur.executemany("INSERT INTO vehicle (id, title, body, lat, long, posted, updated, drive, odometer, type, header, condition, cylinders, fuel, size, transmission, byOwner, city, time, description, location, url, price, year, maker, makerMethod) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?);", to_db)
con.commit()
con.close()
Now, we can start to do some interesting exploration of the dataset. First, let's create the connection to the SQL file
# Create the connection to the SQL file
sqlite_file = "./vehicle.sqlite"
conn = sqla.create_engine('sqlite:///' + sqlite_file)
What are the most common cars for sale by owner and by dealer?
sql_query = """
SELECT byOwner, count(*) as count
FROM vehicle
group by byOwner;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head(n=5)
selected_table.plot.bar(x=selected_table['byOwner'], title="the most common cars for sale by owner and by dealer")
For the column byOwner, TRUE means the post is made by owner. FALSE means the post is made by dealer. We can see there are 4862 posts are from owners and 6697 posts are from dealers.
What are average car age sell in each city?
Let's first take a look the distribution of the overall car age.
sql_query = """
SELECT 2017-year as age
FROM vehicle;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.plot.hist()
As we can see, we can't see the shape of the distribution in the histogram, a reason is because some car ages calculated are very large. Let's explore that part. The top 5 car ages are listed below.
sql_query = """
SELECT 2017-year as age
FROM vehicle
order by age desc;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head()
There is a car age 2013, no wonder we see the range of age is that huge. Now, let's subset it again and remove this number from the data before we plot the histogram.
sql_query = """
SELECT 2017-year as age
FROM vehicle
where age < 100;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.plot.hist()
After we remove the outlier, we can see that the distribution of car age is skewed to right. Next, we want to see the average car age sold in each city
sql_query = """
SELECT city, avg(2017-year) as avg_age
FROM vehicle
group by city
having avg_age < 100;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table
As we can see, the average car ages is the highest in Chicago. The lowest average car ages is in Denver.
What are distribution of car age different between the posts by seller and owner?
sql_query = """
SELECT 2017-year as age
FROM vehicle
where age < 1000 and byOwner='FALSE';
"""
x = pd.read_sql_query(sql_query,conn)
x = x.as_matrix()
sql_query = """
SELECT 2017-year as age
FROM vehicle
where age < 1000 and byOwner='TRUE';
"""
y = pd.read_sql_query(sql_query,conn)
y = y.as_matrix()
Below histogram show the distribution of car ages for sell by owner and by dealer.
#x = [random.gauss(3,1) for _ in range(400)]
#y = [random.gauss(4,2) for _ in range(400)]
bins = np.linspace(0, 90, 100)
plt.hist(x, bins, alpha=0.5, label='by dealer')
plt.hist(y, bins, alpha=0.5, label='by owner')
plt.legend(loc='upper right')
plt.show()
As we can see, the car sell by onwer are in general older than the car sell by dealer. It makes sense because dealer used to keep young car so they can sell it easily.
Explore the relationship between odometer and age of car
sql_query = """
SELECT odometer, 2017-year as age
FROM vehicle
where age < 100 and odometer != 'NA';
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head(n=5)
odometer = selected_table.odometer.astype(str).astype(int).astype(float)
age = selected_table.age
The missin values are removed from odometers. Histogram of odometers is given below.
odometer.plot.hist(title='histogram of odometer')
As we can see, it is hard to see the distribution of odometer. It seems that we have an outlier. Let's check the possible outliers.
sql_query = """
SELECT odometer, 2017-year as age
FROM vehicle
where age < 100 and odometer != 'NA'
order by odometer desc;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head(n=5)
So, there is an odometer with value 99999999. We should remove them from the dataset.
sql_query = """
SELECT odometer, 2017-year as age, price
FROM vehicle
where age < 50 and odometer != 'NA' and odometer != '99999999' and price != 'NA';
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head(n=5)
Since the data type of the odometer is object. We need to first covert it to string, and then convert it to integer before we can make the scatterplot.
odometer = selected_table.odometer.astype(str).astype(int)
age = selected_table.age
price = selected_table.price.astype(float)
def mean(numbers):
return float(sum(numbers)) / max(len(numbers), 1)
def stddev(lst):
"""calculates standard deviation"""
sum = 0
mn = mean(lst)
for i in range(len(lst)):
sum += pow((lst[i]-mn),2)
return (sum/len(lst)-1)**(1/2)
fig, ax = plt.subplots()
N = len(age)
x = odometer
y = age
colors = np.random.rand(N)
area = np.pi * (15000 * (price.astype(float))/sum(price.astype(float)))**2 # 0 to 15 point radii
ax.scatter(x, y, s=area, c=colors, alpha=0.5)
ax.set_xlabel('odometer')
ax.set_ylabel('car age')
ax.set_title('Scatterplot of car age against odometer')
plt.show()
The scatterplot of car age against odometer is given above. The size of the circle is proportional to the car's sell price. We can see only few odometer are very large. But most of them are below 300000. Also, few cars sell for a very huge amount since few cirles are much larger compared to the rest.
Where are the posts made? visualize it on the map
sql_query = """
SELECT city, count(*) as count
FROM vehicle
group by city;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head(n=5)
import folium
school_inf = pd.read_sql("SELECT city, lat, long FROM vehicle", conn)
lonlat1 = pd.concat([school_inf['lat'], school_inf['long']], axis = 1)
SF_COORDINATES = (42.299098, -71.446701)
map_1 = folium.Map(SF_COORDINATES, zoom_start=12)
marker_cluster = folium.MarkerCluster("School cluster").add_to(map_1)
for each in lonlat1.iterrows():
folium.Marker(list((each[1][0], each[1][1]))).add_to(marker_cluster)
map_1
The interactive plots are given above. We can use it to see where are the posts made. For example, a lot of psts are made from AutoMax Preowned in Boston Area.
What are the top three car models sell on the website?
sql_query = """
SELECT maker, count(*) as count
FROM vehicle
where maker != 'NA'
group by maker
order by count desc
limit 20;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head()
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.plot.barh(x=selected_table.maker, color='blue', title="number of crimes reported in a week for Southern District")
So, the top three car model sell on the website, in descending order, are ford, chevrolet and toyota.
Now, let's try to use wordcloud to visualize it. First, read in all the car maker from the sqlite object.
sql_query = """
SELECT maker
FROM vehicle
where maker != 'NA';
"""
selected_table = pd.read_sql_query(sql_query,conn)
Next, we will load all the modules we need.
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, STOPWORDS
Then, we will calculate the frequency of each car models appear in the data.
tuples = tuple([tuple(x) for x in selected_table.maker.value_counts().reset_index().values])
wordcloud = WordCloud(max_font_size=40).generate_from_frequencies(tuples)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
So, we can see the most common car models are from chevrolet, honda, ford, jeep and toyota.
What are the most common words appear in title of the post and the body of the post? Are they the same?
# First take out the data we need, the title and body of the posts
sql_query = """
SELECT title, body
FROM vehicle;
"""
selected_table = pd.read_sql_query(sql_query,conn)
selected_table.head()
title = selected_table.title
body = selected_table.body
Next, we will take out the title, draw the word cloud and inspectt he most common words.
title_list = [x.encode("ascii","ignore").split(' ') for x in title] # split each word
title_list = reduce(lambda x, y: x+y, title_list, []) # unlist
title_df = pd.DataFrame(title_list) # make it as dataframe
title_df.columns = ['title'] # give it back a column name
The wordcloud of the title of the post is given below.
tuples = tuple([tuple(x) for x in title_df.title.value_counts().reset_index().values])
wordcloud = WordCloud(max_font_size=40).generate_from_frequencies(tuples)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
IT seems that the most common words on title are all about year or car model.
Now, let's check the most common words on body.
body_list = [x.encode("ascii","ignore").split(' ') for x in body] # split each word
body_list = reduce(lambda x, y: x+y, body_list, []) # unlist
body_df = pd.DataFrame(body_list) # make it as dataframe
body_df.columns = ['body'] # give it back a column name
The wordcloud for the body of the post is given below.
tuples = tuple([tuple(x) for x in body_df.body.value_counts().reset_index().values])
wordcloud = WordCloud(max_font_size=40).generate_from_frequencies(tuples)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Obviously, we dont see that years or car models names mentioned a lot here. So, the common words used in title are different from the common words used in body.
The vehicles sell posts informatoin are obtained from craiglist. Since they do not have API option. So, we have to scrape the posts on craiglist to get the car sell posts. At the end, it takes too much time to clean the data. Therefore, we use another dataset which is provided by Duncan in one of his STA141 classes. The car posts are mainly made by dealers. In general, the age of the car sell on the website by owner is older than the age of the car sell on the website by dealer. The scatterplot of car's age and odometer show that most odometers are less than 300000. Besides, few cars are selling for a huge amount. Most of the cars sell on the website are made by ford, toyota, chevrolet, honda and jeep. From the interactive plot, we can see most posts are coming from boston, denver and chicago. The ballon in the interactive also tell us a lot of posts are made from the samee dealers in the local areas. All in all, the words mentioned the most on the title of the post are different from the words mentioned the most on the body of the post.